Exploratory Analysis of White wine quality by Pooja Joshi

White wine is a wine whose colour can be straw-yellow, yellow-green, or yellow-gold. It is produced by the alcoholic fermentation of the non-coloured pulp of grapes, which may have a skin of any colour. White wine has existed for at least 2500 years

Citation Request: This dataset is public available for research. The details are described in [Cortez et al., 2009].

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent).

Attribute information:

Input variables (based on physicochemical tests):

Description of attributes:

Output variable (based on sensory data):

Univariate Plots Section

##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.0             0.27        0.36           20.7     0.045
## 2 2           6.3             0.30        0.34            1.6     0.049
## 3 3           8.1             0.28        0.40            6.9     0.050
## 4 4           7.2             0.23        0.32            8.5     0.058
## 5 5           7.2             0.23        0.32            8.5     0.058
## 6 6           8.1             0.28        0.40            6.9     0.050
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  45                  170  1.0010 3.00      0.45     8.8
## 2                  14                  132  0.9940 3.30      0.49     9.5
## 3                  30                   97  0.9951 3.26      0.44    10.1
## 4                  47                  186  0.9956 3.19      0.40     9.9
## 5                  47                  186  0.9956 3.19      0.40     9.9
## 6                  30                   97  0.9951 3.26      0.44    10.1
##   quality
## 1       6
## 2       6
## 3       6
## 4       6
## 5       6
## 6       6
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

Univariate Analysis

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Acids are major wine constituents and contribute greatly to its taste.In fact, acids impart the sourness or tartness that is a fundamental feature in wine taste. Wines lacking in acid are “flat.

High levels of Volatile acidity lead to unpleasant vinegar taste.

Citric acid is found in very small quantities and adds freshness to the wine.

After adjusting the binwidths and breaks I end up getting these plots. Fixed acidity,Voaltile Acidity,Citric Acid are Normally Distributed.I didn’t understand the unusual peak occured in the distribution of citric acid at 0.49. Contribution of Fixed acidity is more in white wine than volatile and citric acid Which is resonable because high amount of volatile acidity makes wine taste like vinegar.We can see from our summary data that the mean value of fixed acidity is 6.855 g/dm^3 and median is 6.800 g/dm^3, similarly mean value of volatile acidity and citric acid is 0.2782g/dm^3 and 0.3342g/dm^3 respectively.Median values of volatile acidity and citric acid is 0.2600g/dm^3 and 0.3200g/dm^3 respectively.Another remarkable thing which I observed is mean and median values of all above three columns are near to each other this suggest outliers have not affected mean of our dataset.It would be intersting to check if their contribution changes with quality.

Let’s Move ahead to the distribution of alcohol content in wine

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Distribution of alcohol content is trimodal(multimodal) in nature. Percentage of alcohol content in wine ranges from 8.5 to 14.0 Mean alcohol content is 10.51 and median being 10.40.

Let us look now at the distribution of residual sugar

Basically, when winemaking happens, yeast eats sugar and makes ethanol (alcohol) as a by-product.I wonder how the relation between alcohol and sugar would be.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

We can see we have few outliers in Residual Sugar but they haven’t drag mean far away form median.Also distribution of residual sugar is having a long tail data we should scale x axis to get better visualization of distribution.

Distribution of data is bimodal. As mentioned in the introduction part it is rare to find residual sugar content less than 1 g/dm^3 and which is true in the above distribution. bulk of the data is ranging from 1.5 g/dm^3 to 17 g/dm^3.If we look at the summary statistics mean is 6.391g/dm^3 and median is 5.200g/dm^3.Our third quartile is 9.9 it implies 75% of the data is below 3rd quartile So definitely max value of 65.800 is an outlier.

Now moving ahead to density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Distribution of density is normally distributed.We can see outliers at 1.010 and 1.038. density ranges from 0.9871 g/cm^3 to 1.0031 g/cm^3 and it is closely related to water depending on the percent alcohol and sugar content. we will look at the correlation between alcohol,sugar content and density in further analysis.

Moving our focus to sulfur dioxide

Total sulfur dioxide is combination of free sulfur dioxide and bound sulfur dioxide

the alcoholic fermentation will produce sulfur dioxide.We will look in the relation between alcohol and sulfur dioxide in later section.

But for now I am creating a new feature variable called bound.sulfur.dioxide and we will look into their distribution

Distribution of sulfur dioxide is normally distributed with over 10 to 250ppm. some outliers are observed beyond 400ppm.

We should zoom in more to free sulfur dioxide and bound sulfur dioxide but the distribution is fairly normal in both.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0    78.0   100.0   103.1   125.0   331.0

Distribution of all the three is fairly normally distributed.alcohol fermantation produces Total sulfur dioxide in the range between 10 to 250 ppm(parts per million)

Bound sulfur dioxide contributes more to total sulfur dioxide it ranges between 10 to 200 ppm whereas free sulfur dioxide ranges between 1 to 80 ppm.

Let us look at the distribution of chlorides

Chlorides tells us the amount of salt in the wine and Chloride concentration in the wine is influenced by terroir but we don’t have terroir data along with us.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Chlorides in white wine ranges from 0.00 to 0.1 g/dm^3.We can see lot of outliers in the tail region,hence our distibution is narrowed down.

but I wonder if chlorides contribution changes with quality.

Just by a google search I understand Winemakers use pH as a way to measure ripeness in relation to acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Distibution of pH is ranging between 2.7 to 3.8. pH levels describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic).

It would be intersting to check what level of acidity receives highest quality rating.So I would be using cut function to create acidity levels.

Let’s look at quality ratings

Quality rating of 3 being lowest and 9 is the highest.datapoints for Quality ratings of 5,6 and 7 are more compared to remaining quality ratings.So I am making new feature variable quality bucket in which I will try to distribute ratings equally.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.878   6.000   9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

What is the structure of your dataset?

  • White wine dataset contains 4898 rows with 13 variables.Columns are fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphate,alcohol,quality. Most of the feature variables are normally distributed viz.

  • 1.Fixed acidity

  • 2.volatile acidity

  • 3.citric acid

  • 4.density

  • 5.sulfur dioxide

  • 6.pH

  • 7.chlorides

Alcohol has trimodal distribution whereas Residual sugar is having bimodal distribtuion.

  • Other observations

pH level of wine in our dataset is 2.7 to 3.8,Median pH level is 3.1. Quality ratings are between 3(lowest) to 9(highest).Most of our observations have quality rating of 6

What is/are the main feature(s) of interest in your dataset?

  • Quality of wine majorily depends on alcohol content,residual sugar,acidity,pH levels. so my focus would be to find relationships among these predictors

What other features in the dataset do you think will help support your into your feature(s) of interest?

  • Features such as density,total sulfur dioxide,chlorides,sulphate and fixed acidity might also be useful in predicting quality.

Did you create any new variables from existing variables in the dataset?

  • Yes I created variable named acidity levels.I used cut function on pH levels and divided them into four major acidity levels such as High,Moderately high,Medium,low. Also I created another feature as quality_bucket.I divided quality levels in three major buckets Low,Medium,High. total Sulfur dioxide = free sulfur dioxide + bound sulfur dioxide I created feature variable bound sulfur dioxide to check how it behaves with alcohol.

Of the features you investigated, were there any unusual distributions? you perform any operations on the data to tidy, adjust, or change the form the data? If so, why did you do this?

  • Distribution of feature variables is normally distributed except for alcohol content and for residual sugar content.Distribution for alcohol content was trimodal and for residual sugar it had long tail so I used log transformation to make it normally distributed.

Bivariate Plots Section

Let’s plot correlogram to get idea of the correlation between feature variables

I can see very strong correlation between residual sugar and density also between alcohol and density.first and foremost I will explore strong relationships and then the other relationships which I have mentioned in univariate section.

Residual Sugar and density are strongly correlated with each other.Density increases with increase in residual sugar.I wonder how residual sugar and density changes with quality of wine rating.

As we have seen in the introductory part density depends on residual sugar content and alcohol. This correlation is extremely strong and implies with increase in residual sugar density of wine increases.

Strong Negative relation is observed between alcohol and density.Density decreases with increase in alcohol content.It would be intersting to check what happens with alcohol content with increasing quality.

I didn’t expect relation between density and sulfur dioxide but this is moderately a strong relationship and density increases with increase in total sulfur dioxide.

We have few data points in our dataset for quality 3,4,8,9 but then too we can observe density lies between 0.99g/cm^3 to 1.00 g/cm^3

Keep in mind,I am trying to find how density varies with quality.I didn’t get impressive results from scatterplot.Let’s move to boxplot.

I can observe density decreases with quality but then too I will zoom in for more details.

## wine$quality_bucket: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9932  0.9951  0.9952  0.9971  1.0024 
## -------------------------------------------------------- 
## wine$quality_bucket: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## -------------------------------------------------------- 
## wine$quality_bucket: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9905  0.9917  0.9924  0.9936  1.0006

Yes,we can clearly observe density decreases with increase in quality.Let’s move ahead to observe chlorides correlation with density.

Relation between chlorides and density is moderately strong.Does chloride changes with quality? I will look into it in further analysis.

Relation is moderately strong .Fixed acidity of white wine is between 6 g/dm^3 to 9 g/dm^3.

Mean Alcohol content is increasing with quality.I would also like to check the effect with boxplot.

As expected alcohol content increases with quality.

Residual Sugar decreases with quality.so is it something like with increasing quality white wines are more alcoholic and less sweet.

To be more sure I will check it with boxplot.

Yes we can clearly see the pattern residual sugar decreases with quality.So I conclude with increasing quality wines become more alcoholic and less sweet.

Moving ahead to check how fixed acidty varies with quality

We can see fixed acidity is not influenced by quality.Wine quality of Low,Medium,High has fixed acidity ranging between 5 g/dm^3 to 9 g/dm^3 .I will also look at their box plot.

Let’s zoom in more

## wine$quality_bucket: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.200   6.400   6.800   6.962   7.500  11.800 
## -------------------------------------------------------- 
## wine$quality_bucket: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.838   7.300  14.200 
## -------------------------------------------------------- 
## wine$quality_bucket: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.900   6.200   6.700   6.725   7.200   9.200

Median values are very close to each other I can’t see much difference in them.Fixed acidity doesn’t change with quality.

If I look at the median values of volatile acidity for lowest quality 3,4 median values are higher than for quality 8,9.For more information I will plot it in multivariate sectoion

Observed pH levels are from 2.8 to 3.6 irrespective of quality. I will plot box plot for more information.

Median value moderately gets higher as per each quality levels but the increment is extremely small.

I will try to plot how acidity levels are distributed among different qualities

Relation between chlorides and alcohol is moderately strong but negative.It shows with increasing content of chlorides in wine alcohol content decreases.

Fermentation of alcohol produces sulfur dioxide so I thought with increasing percentage of alcohol sulfur dioxide should increase but the relation I have got here is divergent.

For alcohol percent between 8.5 to 10.5 sulfur dioxide ranges from 100ppm to 250ppm but as the alcohol content increases sulfur dioxide is observed between 80ppm to 120ppm.

I wonder how ratio of free sulfur dioxide to bound sulfur dioxide changes with alcohol.

Ok I understand for low as well as high alcohol content free sulfur dioxide is 0.4times higher than bound sulfur dioxide.

With increasing quality ratio of free sulfur to bound sulfur dioxide increases.

Amount of Chlorides decreases with higher quality. I would like check the chlorides distribution with acidity levels.

## wine$acidity.levels: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0140  0.0350  0.0420  0.0479  0.0510  0.3010 
## -------------------------------------------------------- 
## wine$acidity.levels: Moderately High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03700 0.04400 0.04777 0.05175 0.27100 
## -------------------------------------------------------- 
## wine$acidity.levels: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04447 0.04900 0.20100 
## -------------------------------------------------------- 
## wine$acidity.levels: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01300 0.03400 0.04200 0.04257 0.04900 0.34600

Chlorides distribution doesn’t particularly change with acidity levels.I am going to focus more on the other feature variables distribution with acidity levels.

## wine$acidity.levels: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    18.0   105.0   132.0   136.4   167.8   366.5 
## -------------------------------------------------------- 
## wine$acidity.levels: Moderately High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    28.0   113.0   142.0   143.9   174.0   344.0 
## -------------------------------------------------------- 
## wine$acidity.levels: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    24.0   107.0   132.0   137.3   166.0   307.5 
## -------------------------------------------------------- 
## wine$acidity.levels: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   132.0   135.7   162.0   440.0

I don’t see any variations with sulfur dioxide

## wine$acidity.levels: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2300  0.3900  0.4600  0.4752  0.5400  1.0000 
## -------------------------------------------------------- 
## wine$acidity.levels: Moderately High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2500  0.4100  0.4700  0.4793  0.5200  1.0600 
## -------------------------------------------------------- 
## wine$acidity.levels: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4889  0.5400  1.0800 
## -------------------------------------------------------- 
## wine$acidity.levels: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2800  0.4300  0.5000  0.5181  0.5900  0.9700

## wine$acidity.levels: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.20   10.10   10.34   11.30   13.70 
## -------------------------------------------------------- 
## wine$acidity.levels: Moderately High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.38   11.20   14.20 
## -------------------------------------------------------- 
## wine$acidity.levels: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.50   10.50   10.63   11.43   14.05 
## -------------------------------------------------------- 
## wine$acidity.levels: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.90   10.50   10.73   11.50   14.00

## wine$acidity.levels: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.900   7.000   7.546  12.100  26.050 
## -------------------------------------------------------- 
## wine$acidity.levels: Moderately High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   2.025   6.400   6.895  10.200  31.600 
## -------------------------------------------------------- 
## wine$acidity.levels: Medium
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.700   4.900   5.971   8.500  20.800 
## -------------------------------------------------------- 
## wine$acidity.levels: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.700   1.500   2.800   4.988   7.200  65.800

Residual sugar increases with acidity levels.The higher the acidity in a wine, the more residual sugar the wine can have.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the . How did the feature(s) of interest vary with other features in dataset?

  • I have observed many interesting relationships among feature variables.

  • For High quality wine, alcohol content is high while residual sugar content is low.That means higher quality wines are more alcoholic and less sweet.

  • Density of wine and chlorides content decreases with higher quality.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • Residual sugar increases with acidity levels.The higher the acidity in a wine, the more residual sugar the wine can have.

  • Alcohol and residual sugar are in inverse proportion if we increase one quantity other decreases.

  • Fermentation of alcohol produces sulfur dioxide With increasing alcohol content total sulfur dioxide decreases.On average free sulfur dioxide is 0.4times higher than bound sulfur dioxide for low as well as high alcohol content.

What was the strongest relationship you found?

  • Strongest Relationship I found among Residual Sugar,Alcohol and density. With Correlation of 0.8 between residual Sugar and density that is density increases with increasing residual sugar.Also Strong negative relationship of 0.8 between alcohol and density.Density decreases with increasing alcohol content.

Multivariate Plots Section

Here,I am going to explore plots from bivariate section.First I’ll start exploring different feature variables variation with density and quality.

we have concluded in bivariate section density increases with residual sugar.After adding jitter, transparency, and changing the plot limits we can see the variation in residual sugar and density.It also tells for high quality wine density decreases even for same level of residual sugar.

Next,I’ll look at density vs alcohol variation with quality.

After adding jitter, transparency, and changing the plot limits we can see the strong negative relation between alcohol and density.For high quality wine alcohol content is high whereas density is low.

After changing the plot limits we can observe Chlorides and density has moderate positive correlation.We can observe for higher quality wine, chlorides content and density of wine is low.

Next,I’ll look at scatterplot of total sulfur dioxide and density.

After adding jitter, transparency, and changing the plot limits we can observe, for low quality wine ,quantity of sulfur dioxide as well as density is high.However for higher quality wine density and sulfur dioxide is low.

Moving ahead I’ll plot scatterplot for density and fixed acidity

After adding jitter, transparency, and changing the plot limits we can observe,for low quality wine fixed acidity and density are higher.

Next, I’ll look how alcohol varies with different feature variables

After changing the plot limits,above plot clearly implies residual sugar and alcohol are in inverse proportion to each other.with increasing quality alcohol content increases and residual sugar decreases.

After adding jitter, transparency, and changing the plot limits we can see chlorides content is high for low quality wine while alcohol content is low compared to high quality wine.

We know that from bivariate plot section alcohol and total sulfur dioxide are negatively correlated.After adding jitter, transparency, and changing the plot limits,we can see for low quality wine sulfur dioxide is high and alcohol content is low.

Fixed acidity is between 5 g/dm^3 to 9 g/dm^3.I don’t see any variation for fixed acidity levels by quality but we can observe again alcohol content is high for high quality wine.

Quality 6 seem to overplot alot.I will remove it to see if we get more useful information.

For Quality 7 I can clearly observe that volatile acidity is increasing with alcohol content with remaining qualties median values are either same for increasing alcohol content or doesn’t show a clear pattern.

Fixed acidity is between 5 g/dm^3 to 10 g/dm^3.here also I don’t see any variation for acidity levels by quality but we can observe again residual sugar content is low for high quality wine.

Let’s move ahead to see variation of alcohol content and residual sugar by acidity levels

We have seen in the bivariate analysis the higher the acidity in a wine, the more residual sugar the wine can have.We can observe this clearly here with acidity levels sugar increases and alcohol content decreases.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the . Were there features that strengthened each other in terms of at your feature(s) of interest?

  • I particualrly analysed relations in bivariate sections but by differentiating them with quality and acidity levels I ended up getting asthetic plots.

  • Residual sugar,alcohol and density had strong relationship but by varying them with quality buckets.I could come up with more clear observation.For high quality wine alcohol content is high,density and residual sugar content is low.it sums up that alcohol and residual sugar are in inverse proportion.

  • For high quality wine density,chlorides,total sulfur dioxide and fixed acidity are low.

Were there any interesting or surprising interactions between features?

  • For high acidity levels Content of residual sugar increases and alcohol content decreases.Higher the acidity in a wine, the more residual sugar the wine can have.

Final Plots and Summary

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Description One

To improve the plot,I have used quality feature as factor variable and I am happy with the final output.It clearly shows the difference for low quality wine, pH level is low except for quality no.3 and pH level increases with higher quality wine.

  • red dotted line indicates mean value and blue dotted line indicates median value.

Three major feature affect quality which are alcohol,residual sugar and acidity levels.So I thought of plotting other two w.r.t quality feature as factor variable.The results are pretty intuitive.For quality rating below 6 alcohol content is low and it increases tremendously for high quality wine and vice versa for residual sugar.

Plot Two

Description Two

I haven’t shown any different relation output here but I have tried to improve the visualization of scatter plot between alcohol and density.They both are in strong negative relation I have tried fitting a line and also plotted histogram of both the feature variables using ggMarginal.

Plot Three

Description Three

This plot shows density and residual sugar are strongly correlated.Fitting a line makes it more clear to understand the visualization.

Reflection

I started the exploration with usual look at the dataset by using head,summary functions and then plotted different histograms and frequency polygons to understand the distribution of my feature variables.I had low knowledge of wines so I started reading articles to understand each feature variable meaning.This helped me to make new feature variable bound sulfur dioxide and then I got an idea of finding it’s ratio with free sulfur dioxide.After doing both these steps I moved forward in bivariate section to understand the correlation among the feature variables.By reading articles and correlogram, I had intuition that density,alcohol,residual sugar and acidity levels are major important fetures for quality.Scatterplots helped me visualize the findings and made observations very clear.Fermentation of alcohol produces total sulfur dioxide but with increasing alcohol content it decreases which was quite surprising to me.If we had data of price of wine,its terrior and year when it is manufactured, It would have helped gain more insights because Chloride concentration is influenced by terroir and grape type.I have read, wines get better with age also price would have helped me find distinction between low and high quality wines.

Reference